MSR14 Comparisons of Encoding Techniques for Categorical Features in Linear Regression Models
نویسندگان
چکیده
In healthcare research, data contains many types of categorical variables, such as race, country, zip code, from low to high number levels in each category. However, aforementioned features need encode into numeric forms before applying machine learning algorithms. Therefore, it is critical find a suitable encoding method for coefficient estimation and prediction. this study, we investigate three commonly used methods compare them estimation, feature selection, prediction linear regression analysis. We label encoding, one-hot target leave-one out (n_level=5) (n_level=50) variables under balanced unbalanced synthetic designs. apply different algorithms (ordinary least squares (OLS), Bayesian ridge logistic regression) on datasets the binary classification settings. low-level settings with continuous outcomes, all can identify true important features, achieves smallest mean absolute error (MAE) both coefficients For classification, fails detecting accuracy around 50%. OLS scenario, derived shift far away value, especially imbalanced high-level (n_level=50), outperforms other two MAE, stable selection. One-hot has relatively however, could not estimations are stable. Label-encoding able largest MAE. Target leave one traditional terms performance, categories.
منابع مشابه
Categorical linear models
Suppose that we have a continuous random variable Y and a categorical variable C with levels c 1 • Are all of these random variable means the same? In other words, is it true that: µ 1 (Y) = µ 2 (Y) = · · · = µ (Y)? • For which pairs of levels c i , c j are the associated random variable means µ i (Y), µ j (Y) equal? • For each pair of levels c i , c j , what is the difference µ j (Y) − µ i (Y)...
متن کاملRobust Estimation in Linear Regression with Molticollinearity and Sparse Models
One of the factors affecting the statistical analysis of the data is the presence of outliers. The methods which are not affected by the outliers are called robust methods. Robust regression methods are robust estimation methods of regression model parameters in the presence of outliers. Besides outliers, the linear dependency of regressor variables, which is called multicollinearity...
متن کاملSome Forecast Methods in Regression Models for Categorical Time Series
We are dealing with the prediction of forthcoming outcomes of a categorical time series We will assume that the evolution of the time series is driven by a covariate process and by former outcomes and that the covariate process itself obeys an autoregressive law Two forecasting methods are presented The rst is based on an integral formula for the probabilities of forthcoming events and by a Mon...
متن کاملInferential models for linear regression
Linear regression is arguably one of the most widely used statistical methods. However, important problems, especially variable selection, remain a challenge for classical modes of inference. This paper develops a recently proposed framework of inferential models (IMs) in the linear regression context. In general, the IM framework is able to produce meaningful probabilistic summaries of the sta...
متن کاملChain graph models of multivariate regression type for categorical data
Abstract: We discuss a class of graphical models for discrete data defined by what we call a multivariate regression chain graph Markov property. We propose a parameterization based on a sequence of generalized linear models with a multivariate logistic link function. We show the relationship with a chain graph model recently defined in the literature, and we prove that the proposed parametriza...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Value in Health
سال: 2022
ISSN: ['1098-3015', '1524-4733']
DOI: https://doi.org/10.1016/j.jval.2022.04.1221